Critical in PDF Generation

AI Created Image

Hi Everyone,

Today, I want to discuss an advanced topic concerning PDF generation in applications and the critical vulnerabilities associated with it. You’re all familiar with this well-known common vulnerability, but I just wanted to share it with you.

Most of the web applications provide a PDF generation features, commonly used for invoices or reports, which often incorporate dynamic user input. In this we will discuss the misconfigurations and vulnerability that can lead to critical security vulnerabilities. It’s basically caused by HTML injection in the user input that is processed by PDF generation libraries.

Let's talk about PDF!

PDF — Portable Document Format is a widely used format designed for platform-independent document display. PDF files are widely used for many applications. Many web applications incorporate PDF generation capabilities, typically through external libraries or plugins.

However, vulnerabilities can arise due to misconfigurations, insufficient security settings, or outdated versions of these libraries, often allowing attackers to exploit unsanitized malicious input.

Here are a few PDF generation libraries commonly used in web applications:

Web applications often need to control the layout of generated PDF files, so these libraries take HTML as input and use it to produce the final PDF. This enables the application to manage the PDF’s design through CSS within the HTML. These libraries operate by parsing the HTML, rendering it, and then converting it into a PDF.

Example: TCPDF

TCPDF is a popular open-source PHP library used to generate PDF documents programmatically. It is known for its ability to convert HTML and CSS into a PDF file without requiring any external extensions. Below is an overview of how TCPDF works, followed by some example code to illustrate its usage

How TCPDF Works:

Initialize the PDF Document: TCPDF allows you to create a new PDF instance, where you can set properties like page size, margins, orientation, etc.
Adding Content: You can add text, HTML, images, tables, and other elements to the PDF using different functions.
Rendering HTML to PDF: TCPDF can take HTML and CSS code and render it as a styled PDF.
Output PDF: Once all the content has been added, the library provides methods to save the PDF to a file, force a download, or display it directly in the browser.

Example Code:

Since TCPDF can take HTML as input to generate PDF files, if the application allows untrusted user input to be included in the HTML without proper sanitization, it can lead to HTMLInjection. Attackers could inject malicious HTML, which may result in xss and html injection.

$html = '<h1>' . $_GET['title'] . '</h1>'; // Vulnerable, unsanitized user input $pdf->writeHTML($html);

If an attacker passes a script tag in the title parameter (<script>alert('XSS')</script>), the generated PDF could contain harmful code. If the user input is not sanitized it will directly renter or exedute in the PDF.

Another example: wkhtmltopdf

Wkhtmltopdf is an open-source command-line tool that converts HTML to PDF using the WebKit rendering engine, which is also used in web browsers. It renders HTML pages into PDF with support for CSS, JavaScript, and even images. It is often used in server-side environments to generate PDFs from HTML templates.

wkhtmltopdf
All downloads are currently hosted via GitHub releases, so you can browse for a specific download or use the links…wkhtmltopdf.org

Avoid using wkhtmltopdf with untrusted HTML content. Always sanitize user-provided HTML or JavaScript, as failure to do so may result in a complete server compromise!

Setting up and installation you can refer online documents.

How wkhtmltopdf Works:

Input HTML: wkhtmltopdf takes an HTML file or web URL as input. It can also accept inline HTML code.
Rendering: It uses WebKit to render the HTML, including styles and JavaScript, in the same way a web browser would.
Output PDF: Once rendered, the content is converted into a PDF document.

After downloading wkhtmltopdf, we can install it using the following command on Debian-based Linux distributions:

[!bash!]$ sudo dpkg -i wkhtmltox_0.12.6.1-2.bullseye_amd64.deb

Running wkhtmltopdf with the -h option will display the tool's help information:

[!bash!]$ wkhtmltopdf -h

Name:

wkhtmltopdf 0.12.6.1 (with patched qt)

Synopsis:

wkhtmltopdf [GLOBAL OPTION]... [OBJECT]... <output file>

<SNIP>

When providing a URL to wkhtmltopdf, it will automatically fetch the website and convert it to a PDF:

[!bash!]$ wkhtmltopdf https://application.com/ thisfile.pdf

Loading pages (1/6)

Counting pages (2/6)

Resolving links (4/6)

Loading headers and footers (5/6)

Printing pages (6/6) Done

By examining the generated PDF, we can still identify the application website, though it has been resized to fit the PDF pages.

wkhtmltopdf

Simple HTML to PDF Conversion Example

Here’s how you can use wkhtmltopdf from the command line to convert an HTML file to a PDF.

$wkhtmltopdf input.html output.pdf

This command takes an HTML file (input.html) and converts it into a PDF (output.pdf).

Additionally, we can supply the tool with a local HTML file to better simulate how a PDF generation library operates within a web application. For instance, consider the following HTML file:

We can now execute wkhtmltopdf on this HTML file to generate a corresponding PDF.

[!bash!]$ wkhtmltopdf ./index.html output.pdf

Loading pages (1/6)

Counting pages (2/6)

Resolving links (4/6)

Loading headers and footers (5/6)

Printing pages (6/6) Done

htb snippet

wkhtmltopdf tool will do converting HTML to PDF, and it can be easily integrated into web applications. It supports modern web technologies, making it ideal for generating rich PDFs from dynamic HTML content in server-side applications.

Here’s a simple real-world example of how a web application might generate a PDF receipt after a user submits a purchase form.

Example: Invoice PDF Generation

1. HTML Form (User Input)

Create a basic HTML form (purchase.html) where users can enter their details to generate a PDF invoice. This can be effortlessly accomplished with a PDF generation library. For instance, we can download an open-source invoice HTML template and use wkhtmltopdf to create a PDF invoice from the HTML code with its custom CSS. The resulting PDF will appear as follows:

Source: htb

We can even analyze the PDF files with different tools and can be utilized to identify specific vulnerabilities and misconfigurations.

Most of the library which we mentioned add some metadata information and we can utilize to identify the vulnerabilities.

To display the metadata we can use exiftool. You can refer the documentations for further options. It will display the Creator of pdf files.

e.g.:

user$ exiftool invoice.pdf

Creator: wkhtmltopdf 0.12.6.1

This information we can use for identify specific vulnerability for this particular version. Additionally, another tool is pdfinfo to perform same task.

Now let's move to the EXPLOITATION part.

We learned how PDF generation libraries function and how to identify them. After identifying the libraries, we can explore how to exploit the vulnerabilities that arise from misconfigurations. All of these vulnerabilities rely on inserting malicious user-provided content into the PDF generator’s input.

I have already shared a resource about hacking technique with PDF: Linkedin Post. This is an alternative method involving PDF uploads and exploitation, which you can review later.

Executing HTML Code

The basic test case we have to perform here is the injection of HTML code. This will occurs when an attacker injects malicious HTML into a web application’s PDF generation process. Many PDF generators, such as wkhtmltopdf, TCPDF, or similar libraries, allow HTML input to be converted into a PDF. If this input isn’t properly sanitized, attackers can exploit vulnerabilities by injecting harmful code.

How HTML Code Injection Happens:

User-Supplied Input: If the PDF generator uses user-provided data (e.g., form inputs or dynamic content) to create a PDF, and that input is not sanitized or validated, malicious HTML or JavaScript can be embedded.
PDF Generation: The PDF generator library processes this malicious input and renders it in the resulting PDF.
Exploitation: When the PDF is opened, the embedded HTML or JavaScript could be executed, potentially leading to attacks like XSS (Cross-Site Scripting) or other vulnerabilities.

Example Scenario:

A web application allows users to input HTML code to generate reports or invoices in PDF format. If the input isn’t sanitized, an attacker could submit the following:

<h1>test2</h1> <script>alert('PDF Exploit!')</script>

This code would be processed by the PDF generator, and if JavaScript is allowed in the resulting PDF, it would execute when opened, displaying an alert message. This is a simple example, but more complex exploits could involve stealing sensitive data or compromising the system.

By this we can inject JavaScript code as well to the PDF.

Executing JavaScript Code

Executing JavaScript code refers to the process of running JavaScript commands or scripts within a web browser or other JavaScript runtime environments which is PDF Generator.

Many PDF generation libraries like wkhtmltopdf or TCPDF allow HTML input and may execute embedded JavaScript within that input when generating the PDF.

JavaScript execution can occur in two primary ways:

Client-side Execution: This occurs in the user’s web browser.
Server-side Execution: This occurs in a server environment using platforms like Node.js.

When the PDF generation library processes HTML input, it may execute the injected JavaScript code. Moreover, since the PDF generation library operates on the server, the payload would also be executed on the server, making this type of vulnerability known as Server-Side XSS.

JavaScript execution in PDFs refers to the ability to embed and run JavaScript code within a PDF document. This can be enables more attack vectors. Basically we are looking for user maliciuous input which are directly enters in to the PDF files. The PDF generation library renders the HTML inputs and gets execute the malicious inserted JavaScript Code.

How JavaScript Code Execution Happens in PDF Generation

User Input in HTML: A web application might allow users to input content, such as forms or comments, which is later converted to PDF. If the user input includes JavaScript and is not sanitized, it can be embedded in the PDF document.
PDF Generation: The PDF generator (such as wkhtmltopdf) processes the input, converting the HTML and potentially the JavaScript into a PDF. If JavaScript execution isn’t disabled, the resulting PDF may include interactive JavaScript elements.
Execution on PDF Open: If the PDF reader supports JavaScript execution (as Adobe Reader does), opening the PDF may trigger the embedded JavaScript, which could lead to malicious actions, such as displaying alerts, stealing data, or even compromising the system.

Example of JavaScript Injection in PDF Generation

Suppose a web application allows users to input text, which is then embedded into an HTML template for generating a PDF. An attacker could input the following malicious script:

If this input is not properly sanitized and the PDF generator (e.g., wkhtmltopdf) processes it, the generated PDF will contain the embedded script. When opened in a vulnerable PDF viewer (like Adobe Reader), the JavaScript will execute, displaying the alert or string PDF Hacked will reflected in PDF.

This is a simple basic cross site scripting example. As a basic first exploit, let’s trigger an information disclosure that reveals a file path on the web server. This can be achieved with the following payload:

If you run the above script in a PDF generator that accepts and processes HTML input, the behavior depends on several factors, including how the PDF generator handles JavaScript and whether JavaScript is enabled in the PDF viewer.

The script would attempt to retrieve the current URL or current location of JavaScript (from window.location) and write it directly into the PDF.
Eg: file:///var/www/html/banksecret/secret.html
Since the script is running in a server-side environment (where PDF generation occurs), window.location may not work as expected because window.location is typically used in a browser context to get the current URL of the webpage.
If the PDF generator doesn’t handle browser-like environments, it could fail to execute, leading to either no output or an error in the generated PDF.
The script will not be processed, and it will either appear as plain text in the PDF or be ignored entirely, depending on the generator’s configuration.
When the generated PDF is opened in a viewer that supports JavaScript (e.g., Adobe Reader with JavaScript enabled), the script might execute.
In this case, it will attempt to write the current location of the PDF file (not a web page) into the document. However, most PDF viewers do not allow access to window.location as it’s not a concept within the PDF environment, so it may return null or nothing at all.
The script may also fail entirely if the viewer does not support window.location in the context of a PDF.

Server-Side Request Forgery

Server-Side Request Forgery (SSRF) in PDF Generators occurs when an attacker manipulates a PDF generator to make unauthorized requests on behalf of the server. This can lead to information disclosure, internal network scanning, or even compromise of internal services that are otherwise inaccessible.

SSRF vulnerabilities often arise in systems where external content (e.g., URLs or resources) is dynamically included in generated PDFs. Attackers exploit these vulnerabilities by injecting malicious URLs, tricking the server into fetching unintended resources.

To identify the SSRF we can try with different HTML tags to compel the server to initiate an HTTP request.

In a similar way, we can inject a stylesheet by using the link tag:

Typically, for images and stylesheets, the response does not appear in the generated PDF, resulting in a blind SSRF vulnerability that limits our ability to exploit it. However, depending on the (mis)configuration of the PDF generation library, we can inject other HTML elements that can initiate a request and cause the server to display the response. One such example is an iframe:

Injecting the three payloads and generating a PDF triggers three requests to our collaborator domains, successfully confirming SSRF with all three payloads.

We can verify this by checking the collaborator client and reviewing the output PDF file.

As a result, we have a regular SSRF vulnerability rather than a blind one, which is far more critical as it enables us to exfiltrate data more easily. For example, we can send a request to any internal endpoint and have the response displayed to us. Here’s how we can leak data from an internal API:

SSRF via External Resource Inclusion

If the PDF generator fetches external resources (such as images, stylesheets, or scripts) from user- provided URLs, an attacker can supply a malicious URL pointing to internal services.

The generated PDF includes the response from the internal API, potentially exposing sensitive information that would otherwise be inaccessible from the outside:

Source htb labs

The attacker injects an iframe tag with resource URL pointing to an internal service.
The PDF generator fetches the resource.
The server responds with potentially sensitive information, which is embedded in the PDF, leaking internal data.

Local File Inclustion

Local File Inclusion (LFI) in a PDF generation web application occurs when an attacker can manipulate the input to the PDF generator to include or read files from the server’s file system. This vulnerability often arises when the web application does not properly sanitize user input, allowing the attacker to reference local files on the server.

There are several HTML elements we can attempt to inject in order to read local files on the server.

By executing JavaScript, if the server processes our injected script, we can utilize XMLHttpRequests and the file protocol to read local files, leading to a payload like this:

By injecting this JavaScript code, we can view the contents of the passwd file in the generated PDF:

However, this method can be impractical for certain files, as extracting data from the PDF may corrupt it. For example, syntax might break if we attempt to exfiltrate an SSH key. Additionally, files with binary data cannot be extracted in this manner. Therefore, we should base64-encode the file using the btoa function before including it in the PDF:

However, this results in a single long line that may be truncated if it doesn’t fit on the PDF page, as the library usually doesn’t insert line breaks.

Source htb

This we can modify the payload to add line breaks every 100 characters to ensure it fits on the PDF page.

After making these, we can retrieve the file successfully. The base64-encoded data can now be copied and decoded using any tool that ignores line breaks in the input.

In some cases if the backend not execute our injected JavaScript Code, we must have to run HTML tags to display local files.

Some payloads are below:

However, in our test environment, this only results in an empty iframe being displayed.

To display the contents of a file like /etc/passwd, you can use a different approach that involves redirecting the iframe's src to a controlled server which then fetches the local file. Here’s how you can do it:

Host an application in your localhost with below code

<?php header('Location: file://' . $_GET['url']); ?>

Then we can inject the below code in the application to get the successfull result

After this we will get the below output.

Souce htb

You can try more methods to for this LFI exploitation in PDF generators. Another interesting method is Critical in PDF Generation

PDF annotations are elements like comments, highlights, and attachments that can be added to a PDF. They can be used to include additional data or modify the document’s behavior.

If the application is using mPDF library for PDF Generators, it supports annotations via the

We can use annotations to append files to a generated PDF by injecting a payload such as the following:

Examining the generated PDF file, we see an annotation with an attached file. Clicking on the attachment reveals the /etc/passwd file.

Source htb

Check the mPDF GitHub repository for any security updates related to annotations or content handling.

There are few other libraries that working the Annotations. You can check online.

Mitigations:

Input Validation & Sanitization: Validate and sanitize user inputs, such as HTML content and file paths.
Avoid File Path Exposure: Restrict access to sensitive file paths and use a safe directory for uploads.
Disable JavaScript: Strip or disable JavaScript in user inputs.
Secure Libraries: Keep PDF libraries updated and configure them securely.
Access Controls: Restrict resource access and run PDF generation in a sandboxed environment.
Mitigate SSRF: Block unauthorized outbound requests and validate URLs.
Monitor & Log: Track access to sensitive resources and set up alerts for anomalies.
Developer Training: Educate developers on secure coding practices and conduct code reviews.
Secure Defaults: Use libraries with secure configuration defaults.
Patch Management: Regularly update software and apply security patches.